A threshold in a continuous variable determines treatment assignment
Running variable (or “score”), X: a continuously distributed variable with a clearly defined cutoff (c) that determines which units are assigned to treatment and which ones are assigned to control.
Prior to the treatment, the outcome should not differ between the treatment and control group (continuity assumption). The distribution of the variable which indicates the threshold should have no jumps around this cutoff value.
sharp RDD: the threshold separates the treatment and control group exactly
fuzzy RDD: the threshold influences the probability of being treated. this is in fact an instrumental variable approach (estimating a Local Average Treatment Effect [LATE])
The value of the outcome (Y) for individuals just below the threshold is the missing conterfactual outcome. It increases continuously with the cutoff variable, as opposed to the treatment.
With an RDD approach some assumptions can be tested. Individuals or units close to the threshold are nearly identical, except for characteristics which are affected by the treatment.
RDD’s strengths include that we can:
Load required packages. In this demo, we will use the package “rdrobust”
install.packages("rdrobust",
repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/_c/txgkq4nd3pqfrsyx1ygqk9kh0000gq/T//RtmpUECjyV/downloaded_packages
install.packages("rddensity",
repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/_c/txgkq4nd3pqfrsyx1ygqk9kh0000gq/T//RtmpUECjyV/downloaded_packages
library(tidyverse) # ggplot(), %>%, mutate(), and friends
# library(broom) # Convert models to data frames
library(rdrobust) # For robust nonparametric regression discontinuity
library(rddensity) # For nonparametric regression discontinuity density tests
library(modelsummary) # Create side-by-side regression tables
x<-runif(1000,-1,1) # running variable
y<-5+3*x+2*(x>=0)+rnorm(1000) #outcome with a cut off c at 0
rdrobust(y,x) #apply rdrobust to view sample output
## Call: rdrobust
##
## Number of Obs. 1000
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 515 485
## Eff. Number of Obs. 117 126
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 0.248 0.248
## BW bias (b) 0.424 0.424
## rho (h/b) 0.585 0.585
## Unique Obs. 515 485
We are interested in the causal effect of fishing on crab mortality. This question is surprisingly hard to answer because it’s hard to tease apart natural mortality from fishing mortality, but it is an important effect to estimate for fisheries management.
### Make column for observation ID
crab_data <- data.frame(id = seq(1, 1000),
### Add columns for explanatory variables
### Add column for treatment variable
fished = c(rep(0, 500), rep(1, 500)),
### Add column for body size
size = c(runif(500, min = 10, max = 80),
runif(500, min = 60, max = 130)),
### And the rest of the covariates
sst = runif(1000, min = 10, max = 16),
salinity = runif(1000, min = 1, max = 10),
npp = runif(1000, min = 0, max = 1),
### And the error term
error = rnorm(1000, mean = 0, sd = 2))
### Visualize
ggplot(crab_data) +
geom_boxplot(aes(x = size, group = fished)) +
geom_vline(xintercept = 70)
### center size at legal catch size
crab_data <- crab_data %>%
mutate(size_centered = size - 70)
### Make column for outcome variable (total mortality)
crab_data <- crab_data %>%
mutate(natural_mortality = 10 + -0.4*size_centered + 2*sst + 0.3*npp + 1.5*salinity + error,
fishing_mortality = ifelse(size_centered < 0 & fished == 0,
10 + -0.1*size_centered,
ifelse(size_centered < 0 & fished == 1,
30 + 0.4*size_centered,
ifelse(size_centered > 0 & fished == 0,
10 + -0.1*size_centered,
30 + 0.4*size_centered))),
total_mortality = natural_mortality + fishing_mortality)
### Visualize mortality outcomes
ggplot(crab_data) +
geom_point(aes(x = size_centered,
y = natural_mortality,
color = fished))
ggplot(crab_data) +
geom_point(aes(x = size_centered,
y = fishing_mortality,
color = fished))
ggplot(crab_data) +
geom_point(aes(x = size_centered,
y = total_mortality,
color = fished))
### Visualize covariates
ggplot(crab_data) +
geom_point(aes(x = size_centered,
y = sst,
color = fished))
ggplot(crab_data) +
geom_point(aes(x = size_centered,
y = salinity,
color = fished))
ggplot(crab_data) +
geom_point(aes(x = size_centered,
y = npp,
color = fished))
Example: Rule-based example with cut off: fishing size limits of crabs
In order to be legally harvest, crabs must be larger than 70mm. Crabs smaller than 70mm are not legal to harvest. Harvesting is thus rule-based, according to size. Here, we would probably expect mortality to decline with body size, but the harvest rule changes this relationship at the discontinuity.
Since we know that the program was applied based on a rule, we next want to figure out how strictly the rule was applied (e.g. was there a lot of illegal harvest of crabs that were too small? Were all crabs over 70mm harvested? ).The threshold was 70 mm for legal harvest — e.g., did people who caught crabs that scored 68 mm slip through cracks? The easiest way to check this is with a graph, and we can get exact numbers with a table.
ggplot(crab_data, aes(x = size_centered, y = fishing_mortality, color = fished)) +
# Make points small and semi-transparent since there are lots of them
geom_point(size = 0.5, alpha = 0.5,
position = position_jitter(width = 0, height = 0.25, seed = 1234)) +
# Add vertical line
geom_vline(xintercept = 0) +
# Add labels
labs(x = "Size (centered on 70 as cut-off)", y = "Fished or not") +
# Turn off the color legend, since it's redundant
guides(color = "none")
This discontinuity looks fuzzy rather than sharp.We do verify in the next Step.
We next verify our conclusionss from the above visualization that the design is a fuzzy disconintuinit with a table examinging if there crabs under 70 mm harvested, and some over 70mm that weren’t.
crab_data %>%
group_by(fished, size_centered<= 0) %>%
summarize(count = n())
## `summarise()` has grouped output by 'fished'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: fished [2]
## fished `size_centered <= 0` count
## <dbl> <lgl> <int>
## 1 0 FALSE 81
## 2 0 TRUE 419
## 3 1 FALSE 437
## 4 1 TRUE 63
Based on the table, we can conclude this is a fuzzy design becuase there are some crabs over the cut-off that were not fished, and some under that were fished.
Next, we need to see if any crabs self selected on either side of the size threshold; this seems unlikely. But, first, we’ll make a histogram of the running variable (size) and see if there are any big jumps around the threshold:
ggplot(crab_data, aes(x = size_centered, fill = fished)) +
geom_histogram(binwidth = 2, color = "white", boundary = 0) +
geom_vline(xintercept = 0) +
labs(x = "Size", y = "Count", fill = "Legal to fish")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here it doesn’t look like there’s a jump around the cutoff—there’s a tiny visible difference between the height of the bars right before and right after the 70-mm sie threshold (centered at 0), but it seems to follow the general shape of the overall distribution.
There’s a fuzzy discontinuity, but how big is it? And is it statistically significant?
We can check the size two different ways: parametrically (i.e. using lm() with specific parameters and coefficients), and nonparametrically (i.e. not using lm() or any kind of straight line and instead drawing lines that fit the data more precisely). We’ll do it both ways.
First we’ll do it parametrically by using linear regression. Here we want to explain the variation in final scores based on the entrance exam score and participation in the tutoring program:
\[ \text{Exit exam} = \beta_0 + \beta_1 \text{Entrance exam score}_\text{centered} + \beta_2 \text{Tutoring program} + \epsilon \]
Instead of using linear regression to measure the size of the discontinuity, we can use nonparametric methods. Essentially this means that R will not try to fit a straight line to the data—instead it’ll curve around the points and try to fit everything as smoothly as possible.
The rdrobust() function makes it really easy to measure the gap at the cutoff with nonparametric estimation. Here’s the simplest version:
This is likely a fuzzy RDD set up: some crabs < 70mm are probably harvested illegally, and fishers are not catching every crab > 70mm. So the treatment assigned is not necessarily the treatment received. To estimate the effect of fishing on total crab mortality in a fuzzy RDD, we take a two stage least squares approach:
Modified from the MixTape by Scott Cummingham: https://mixtape.scunning.com/06-regression_discontinuity
library(tidyverse)
# simulate the data
dat <- tibble(
x = rnorm(1000, 50, 25)
) %>%
mutate(
x = if_else(x < 0, 0, x)
) %>%
filter(x < 100)
# cutoff at x = 70, no jump
dat <- dat %>%
mutate(
D = if_else(x > 70, 1, 0),
y1 = 25 + 0 * D + 1.5 * x + rnorm(n(), 0, 20)
)
ggplot(aes(x, y1, colour = factor(D)), data = dat) +
geom_point(alpha = 0.5) +
geom_vline(xintercept = 70, colour = "grey", linetype = 2)+
stat_smooth(method = "lm", se = F) +
labs(x = "Test score (X)", y = "Potential Outcome (Y1)")
## `geom_smooth()` using formula = 'y ~ x'
## simulate the discontinuity
dat <- dat %>%
mutate(
y2 = 25 + 40 * D + 1.5 * x + rnorm(n(), 0, 20)
)
# figure
ggplot(aes(x, y2, colour = factor(D)), data = dat) +
geom_point(alpha = 0.5) +
geom_vline(xintercept = 70, colour = "grey", linetype = 2) +
stat_smooth(method = "lm", se = F) +
labs(x = "Test score (X)", y = "Potential Outcome (Y)")
## `geom_smooth()` using formula = 'y ~ x'
# simultate nonlinearity
dat <- tibble(
x = rnorm(1000, 100, 50)
) %>%
mutate(
x = case_when(x < 0 ~ 0, TRUE ~ x),
D = case_when(x > 140 ~ 1, TRUE ~ 0),
x2 = x*x,
x3 = x*x*x,
y3 = 10000 + 0 * D - 100 * x + x2 + rnorm(1000, 0, 1000)
) %>%
filter(x < 280)
ggplot(aes(x, y3, colour = factor(D)), data = dat) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = 140, colour = "grey", linetype = 2) +
stat_smooth(method = "lm", se = F) +
labs(x = "Test score (X)", y = "Potential Outcome (Y)")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(aes(x, y3, colour = factor(D)), data = dat) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = 140, colour = "grey", linetype = 2) +
stat_smooth(method = "loess", se = F) +
labs(x = "Test score (X)", y = "Potential Outcome (Y)")
## `geom_smooth()` using formula = 'y ~ x'
–>